What happens when we add more predictors to a model?
A simple example using polynomial regression.
The more predictors we include, the more variance we can explain.
However, the more predictors and complexity we include, the more overfitted the model becomes.
Code
set.seed(1030)xsquared <-function(x) { x^2}# Generate xy datasim_data <-function(xsquared, sample_size =100) { x <-runif(n = sample_size, min =0, max =1) y <-rnorm(n = sample_size, mean =xsquared(x), sd =0.05)data.frame(x, y)}# Generate predicted data (model)df <-sim_data(xsquared, sample_size =60)fit <-lm(y ~1, data = df)fit_1 <-lm(y ~poly(x, degree =1), data = df)fit_2 <-lm(y ~poly(x, degree =2), data = df)fit_many <-lm(y ~poly(x, degree =20), data = df)truth <-seq(from =0, to =1, by =0.01)# Combine the data and model fits into a single data framedf <-data.frame(x = df$x,y = df$y,fit =predict(fit),fit_1 =predict(fit_1),fit_2 =predict(fit_2),fit_many =predict(fit_many))# Reshape the data frame into long formatdf_long <-pivot_longer( df,cols =starts_with("fit_"),names_to ="model",values_to ="value") %>%mutate(model =case_when( model =="fit"~"y = b", model =="fit_1"~"y = b + mx", model =="fit_2"~"y = b + mx + nx^2", model =="fit_many"~"y = b + mx + nx^2 + ... + zx^20",TRUE~ model ) )# Plotp <-ggplot(df_long, aes(x = x, y = value, color = model)) +facet_wrap(~model, ncol =2, scales ="free") +geom_point(aes(y = y), alpha = .4, size =2) +geom_line(linewidth = .9, linetype =1) +scale_color_brewer(palette ="Set1") +theme(legend.position ="none") +geom_blank()p
Variance-bias trade-off
In concept…
As complexity increases, bias (\sum{O-\bar{P}}) decreases (the mean of a model’s predictions is closer to the true mean).
As complexity increases, prediction variance (\frac{\sum{(P-\bar{P})}^2}{n}) increases.
The goal is to find a model that isn’t too simple or complex with a good balance between bias and variance.
\text{Mean Squared Error} = \text{Bias}^2 + \text{Variance} + \text{Immeasureable Error} In practice, the math and relationships are a bit more irregular.
How do we determine the best model?
Some model quality measures you should be familiar with:
R2: variance explained by the model with a maximum value of 1 = 100%
Residual standard error: the mean error of the observed values from the predicted/fitted values (i.e. line of best fit).
Partial F-test: compare the full model to a reduced model, works well when the number of predictors is small and simple models.
Some other commonly used measures:
Information criteria: AIC, BIC, etc. (more on this later).
Error measures: useful when the aim for the model is to predict. The best model has the smallest residual error (or other similar metrics).
What models do we try?
Everything: all possible combinations of predictors.
Each variable can either be included or excluded so the number of possible combinations is 2^n
e.g. 3 predictors (x_1, x_2, x_3) could have 8 models
No variables i.e. mean y the null hypothesis
1 variable: x_1; x_2; x_3
2 variables: x_1 + x_2; x_1 + x_3; x_2 + x_3
3 variables: x_1 + x_2 + x_3
So…not recommended
Stepwise regression: add/remove predictors one at a time until removing a variable makes the model worse.
Select meaningful predictors based on domain knowledge, correlation, or significance.
More complex approaches are available for big data and machine learning.
Air quality: can we reduce the number of predictors?
Full model:
Code
summary(full_fit)
Call:
lm(formula = log(Ozone) ~ Temp + Solar.R + Wind, data = airquality)
Residuals:
Min 1Q Median 3Q Max
-2.06193 -0.29970 -0.00231 0.30756 1.23578
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.2621323 0.5535669 -0.474 0.636798
Temp 0.0491711 0.0060875 8.077 1.07e-12 ***
Solar.R 0.0025152 0.0005567 4.518 1.62e-05 ***
Wind -0.0615625 0.0157130 -3.918 0.000158 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.5086 on 107 degrees of freedom
(42 observations deleted due to missingness)
Multiple R-squared: 0.6644, Adjusted R-squared: 0.655
F-statistic: 70.62 on 3 and 107 DF, p-value: < 2.2e-16
Wind has the highest p-value, can we remove it?
Full model: Multiple R-squared = 0.66, Adjusted R-squared = 0.66
Full model: multiple R-squared = 0.66, adjusted R-squared = 0.66
Reduced model: take out Wind
Code
reduced_fit <-lm(log(Ozone) ~ Temp + Solar.R, data = airquality)summary(reduced_fit)
Call:
lm(formula = log(Ozone) ~ Temp + Solar.R, data = airquality)
Residuals:
Min 1Q Median 3Q Max
-1.83864 -0.33727 0.03444 0.29877 1.38210
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.7646527 0.4249016 -4.153 6.58e-05 ***
Temp 0.0607386 0.0056663 10.719 < 2e-16 ***
Solar.R 0.0024651 0.0005924 4.161 6.38e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.5413 on 108 degrees of freedom
(42 observations deleted due to missingness)
Multiple R-squared: 0.6163, Adjusted R-squared: 0.6092
F-statistic: 86.73 on 2 and 108 DF, p-value: < 2.2e-16
where n is the number of observations and p is the number of predictors.
Partial F-test
Partial F-test
How much of an improvement in adjusted R^2 is worth having an extra variable / more complex model?
We can perform a hypothesis test to determine whether the improvement is significant.
The F-test measures the full model against an intercept only model in terms of explained variance (residual sum of squares).
The partial F-test compares the full model to a reduced model in terms of the trade-off between model complexity and variance explained (i.e. adjustedR^2).
H_0: no significant difference between the full and reduced models
H_1: the full model is significantly better than the reduced model
anova(reduced_fit, full_fit) # reduced goes first else will see negative df
Analysis of Variance Table
Model 1: log(Ozone) ~ Temp + Solar.R
Model 2: log(Ozone) ~ Temp + Solar.R + Wind
Res.Df RSS Df Sum of Sq F Pr(>F)
1 108 31.645
2 107 27.675 1 3.9703 15.35 0.0001577 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The partial F-test is significant (p-value < 0.05), so we can reject the null hypothesis and conclude that the full model is significantly better, even if adjusted R2 improves by 4%.
But wait…
Looking back at the original model, we can see that the partial regression coefficients are the same as the partial F-test results!
Code
anova(reduced_fit, full_fit) # partial F-test
Analysis of Variance Table
Model 1: log(Ozone) ~ Temp + Solar.R
Model 2: log(Ozone) ~ Temp + Solar.R + Wind
Res.Df RSS Df Sum of Sq F Pr(>F)
1 108 31.645
2 107 27.675 1 3.9703 15.35 0.0001577 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
loyn_fit <-lm(ABUND ~ YR.ISOL + GRAZE + ALT + AREA + LDIST + DIST, data = loyn)summary(loyn_fit)
Call:
lm(formula = ABUND ~ YR.ISOL + GRAZE + ALT + AREA + LDIST + DIST,
data = loyn)
Residuals:
Min 1Q Median 3Q Max
-17.6638 -4.6409 -0.0883 4.2858 20.1042
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.097e+02 1.133e+02 -0.968 0.33791
YR.ISOL 6.693e-02 5.684e-02 1.177 0.24472
GRAZE -3.447e+00 1.107e+00 -3.114 0.00308 **
ALT 4.772e-02 3.089e-02 1.545 0.12878
AREA 8.866e-04 4.657e-03 0.190 0.84980
LDIST 1.418e-03 1.310e-03 1.082 0.28451
DIST 3.811e-03 5.418e-03 0.703 0.48514
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 7.947 on 49 degrees of freedom
Multiple R-squared: 0.5118, Adjusted R-squared: 0.452
F-statistic: 8.561 on 6 and 49 DF, p-value: 2.24e-06
Code
performance::check_model(loyn_fit, check =c("linearity", "qq", "homogeneity", "outliers")) # check specific assumptions
Checking assumptions - transformation
Code
loyn_fit <-lm(ABUND ~ YR.ISOL + GRAZE + ALT + AREA_L10 + LDIST_L10 + DIST_L10, data = loyn)summary(loyn_fit)
Call:
lm(formula = ABUND ~ YR.ISOL + GRAZE + ALT + AREA_L10 + LDIST_L10 +
DIST_L10, data = loyn)
Residuals:
Min 1Q Median 3Q Max
-15.6506 -2.9390 0.5289 2.5353 15.2842
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -125.69725 91.69228 -1.371 0.1767
YR.ISOL 0.07387 0.04520 1.634 0.1086
GRAZE -1.66774 0.92993 -1.793 0.0791 .
ALT 0.01951 0.02396 0.814 0.4195
AREA_L10 7.47023 1.46489 5.099 5.49e-06 ***
LDIST_L10 -0.64842 2.12270 -0.305 0.7613
DIST_L10 -0.90696 2.67572 -0.339 0.7361
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 6.384 on 49 degrees of freedom
Multiple R-squared: 0.6849, Adjusted R-squared: 0.6464
F-statistic: 17.75 on 6 and 49 DF, p-value: 8.443e-11
Code
performance::check_model(loyn_fit, check =c("linearity", "qq", "homogeneity", "outliers")) # check specific assumptions
Other Assumptions
LINE… + leverage + collinearity
Leverage
The leverage plot shows the influence of each observation (i.e. point) on the model.
Points with high leverage can have a large effect on the model when removed.
Identified by the Cook’s distance statistic – named after the American statistician R. Dennis Cook, who introduced the concept in 1977.
Tip
The leverage plot is a useful tool for identifying outliers and influential points, but can also be used to check for other issues such as heteroskedasticity (equal variances) and non-linearity!
Reading the leverage plot
Code
par(mfrow =c(1,2))plot(loyn_fit, which =c(4,5))
Visually, points with Cook’s distance > 0.5 are considered influential by default, but this is a somewhat arbitrary threshold.
In practice, you should use a threshold that is appropriate for your data and model.
Adj. R-squared varies with addition of predictors.
The problem
Other combinations of predictors exist but are not shown.
Need automated way to select the best model – 6 predictors gives us 2^6 = 64 models to choose from!
Options:
Backward elimination
Forward selection
Combination (R default)
Backward elimination
Steps for backward elimination
Start with full model.
For each predictor, test the effect of its removal on the model fit.
Remove the predictor that has the least effect on the model fit i.e. the least informative predictor, unless it is nonetheless supplying significant information about the response.
Repeat steps 2 and 3 until no predictors can be removed without significantly affecting the model fit.
In backward selection, the model fit is assessed using the Akaike Information Criterion (AIC) or the Bayesian Information Criterion (BIC). Here we focus on the AIC.
About AIC
Most popular model selection criterion (can be used for non-nested models)
Developed by Hirotsugu Akaike under the name of “an information criterion” (AIC)
Founded on information theory which is concerned with the transmission, processing, utilization, and extraction of information.
AIC = 2k - 2\ln(L)
About AIC
AIC = 2k - 2\ln(L)
k is the number of parameters in the model (predictors + intercept)
L is the maximum value of the likelihood function
If we predict using the (current) model, what is the probability density of the prediction compared to the original distribution?
ln(L) = goodness of fit (higher is better)
For the number of parameters in the model (2k) subtract the goodness of fit 2ln(L)
The smaller the AIC, the better the model fits the data.
A relative measure and unitless, so it is not worth trying to interpret alone.
Can be calculated with the AIC() function in R
About AIC - FYI
In linear regression, AIC is sometimes calculated as:
AIC = n\log(\frac{RSS}{n}) + 2k
where RSS is the residual sum of squares, 2k is the number of parameters in the model, and n is the number of observations.
The difference between this equation and the previous equation is a constant, or scaling
The step() function in R uses this equation (hence you may notice a difference in AIC values!)